library(rmarkdown)
library(ggplot2)
library(dplyr)
library(tidyr)
library(Hmisc)

library(summarytools)
library(arsenal)

library(gridExtra)

library(GGally)
library(VIM)

library(kableExtra)
library(pixiedust)

1 Housekeeping

Create a shortcut to the folder where you want your working directory to be. This will be useful later when saving your work.

Back to top

2 Data set

The National Health and Nutrition Examination Survey (NHANES) is a program of studies designed to assess the health and nutritional status of adults and children in the United States. The survey examines a nationally representative sample of about 5,000 persons each year. Link to CDC NHANES website

The NHANES study contains objectively measured physical activity data collected using hip-worn accelerometers from multiple cohorts. High quality processed activity data combined with mortality and demographic information can be downloaded and used in R with code from Andrew Leroux (https://andrew-leroux.github.io/rnhanesdata/articles).

Here, we look at the subset of participants between 50 and 85 years old from the 2003-2004 and 2005-2006 samples (n=2978) who wore a hip-worn accelerometer in the free living environment for up to 7 days. The response ID is seqn.

2.0.1 Sociodemographic variables

  • age at examination [years] (i.e. when participants wore the device), age
  • gender (male and female), gender
  • race/ethnicity (non-Hispanic white, non-Hispanic black, Mexican American, and other)
  • education (< high school graduate, high school graduate/general educational development [GED], some college, and college graduate), educationadult
  • 5 year mortality, NAs for individuals with follow up less than 5 years and alive, yr5.mort
  • Person Months of Follow-up from MEC/Exam Date, permth.exm (follow-up time in this cohort in years = permth.exm/12)
  • final mortality status, mortstat

2.0.2 Health and behavior variables

  • smoking status (current smoker, former smoker [those reporting quitting within the previous 6 months], and nonsmoker), smokecigs
  • alcohol consumption, drinkstatus
  • BMI, bmi
  • obesity, bmi.cat
  • diabetes, diabetes
  • congestive heart failure, chf
  • cancer, cancer
  • stroke, stroke
  • average systolic blood pressure using the 4 measurements per participant, sys ,
  • Total cholesterol (mg/dL), lbxtc
  • HDL cholesterol, lbdhdd
  • mobilityproblem (“No Difficulty”,“Any Difficulty”)

2.0.3 Physical activity data

  • total activity counts per day (TAC/d)
  • total log activity count (TLAC)
  • total minutes of moderate/vigorous physical activity (MVPA)
  • total accelerometer wear time (WT)
  • sedentary/sleep/non-wear to active transition probability (SATP)
  • active to sedentary/sleep/non-wear transition probability (ASTP): Bout length was defined as the number of consecutive minutes spent in either an active or sedentary state and a daily activity profile was created for each participant to detect alternating bouts of sedentary and active states. ASTP was defined as the probability of transitioning from an active to a sedentary state and calculated as the reciprocal of the average active bout duration. ASTP was calculated for each day and averaged across valid days to derive a single measure of ASTP for each participant.

In addition, there are total log activity count summary measures (tlac.1, tlac.2, …, tlac.12) in each 2-hr window, i.e. 12AM-2AM, 2AM-4AM, 4AM-6AM, etc.

Back to top

3 Framework for a systematic initial data analysis (IDA)

The main aim of IDA is seen in providing reliable knowledge about the data to enable responsible statistical analyses and interpretation.

This framework was developed for a primary data collection where data are obtained to address a predefined set of research questions, with an elaborated analysis plan. However, IDA is often performed in more complex studies raising additional issues such as an implementation of IDA processes during ongoing data collections to detect data issues while they are potentially remediable.

3.0.1 Intended statistical analysis

Develop a predictive model for 5 year mortality risk in the NHANES cohort using demographic data, comorbidities, lifestyle factors, and physical activity measurements.

In the NHANES example the theoretical research objective could be to find predictors or 5-year mortality.

3.0.2 Six elements of IDA

  1. Meta data
  2. Data cleaning
  3. Data screening
  4. Initial IDA report
  5. Updating analysis plan
  6. IDA report for manuscript
IDA Framework

IDA Framework

Metadata setup summarizes background information to properly conduct all following IDA steps. Beyond technical metadata such as labels or plausibility limits, this covers conceptual metadata which combines information from the study protocol, secondary information sources and information about the actual study conduct.

Data cleaning is performed to identify and correct technical data errors. Many errors may not be directly observed and a proper metadata setup is crucial to progress correctly and efficiently in this step

Data screening examines data properties to inform decisions about the realizability of the intended analyses. In contrast to the data cleaning step, the focus is on data properties, not technical errors. However, data screening may reveal structural errors that occurred during the data collection process.

Initial data reporting documents all insights obtained from the previous steps to the research body.

Refining and updating the analysis plan account for findings from the previous IDA steps by making adaptations of the analysis plan.

Reporting IDA in research papers is necessary to ensure transparency regarding key findings and actions in the IDA steps that impacted the analysis or interpretation of results. This reporting step is based on the initial data reporting but clearly focused on the specific paper and what has been done, whereas the former provides a general overview of IDA findings and sugges- tions on ways to handle potential conflicts with the analysis plan.

Back to top

4 Meta data

Meta data may include the study information, study protocol, and a data dictionary.

Add missing labels for physical activity data:

## Input object size:    1638080 bytes;  80 variables    2978 observations
## New object size: 1486584 bytes;  80 variables    2978 observations
## 
## Data frame:nhanesdat 2978 observations and 80 variables    Maximum # NAs:2401
## 
## 
##                                                                        Labels
## seqn                                                                         
## paxcal                                                                       
## paxstat                                                                      
## weekday                                                                      
## sddsrvyr                                                                     
## eligstat                                                                     
## mortstat                                                                     
## permth.exm                                                                   
## permth.int                                                                   
## ucod.leading                                                                 
## diabetes.mcod                                                                
## hyperten.mcod                                                                
## sdmvpsu                                                                      
## sdmvstra                                                                     
## wtint2yr                                                                     
## wtmec2yr                                                                     
## ridagemn                                                                     
## ridageex                                                                     
## ridageyr                                                                     
## bmi                                                                          
## bmi.cat                                                                      
## race                                                                         
## gender                                                                       
## diabetes                                                                     
## chf                                                                          
## chd                                                                          
## cancer                                                                       
## stroke                                                                       
## educationadult                                                               
## mobilityproblem                                                              
## drinkstatus                                                                  
## drinksperweek                                                                
## smokecigs                                                                    
## bpxsy1                                                                       
## bpxsy2                                                                       
## bpxsy3                                                                       
## bpxsy4                                                                       
## lbxtc                                                                        
## lbdhdd                                                                       
## yr5.mort                                                                     
## age                                                                          
## sys                                                                          
## exclude                                                                      
## wtint2yr.unadj                                                               
## wtmec2yr.unadj                                                               
## wtint2yr.unadj.norm                                                          
## wtmec2yr.unadj.norm                                                          
## wtint4yr.unadj                                                               
## wtint4yr.unadj.norm                                                          
## wtmec4yr.unadj                                                               
## wtmec4yr.unadj.norm                                                          
## wtint2yr.adj                                                                 
## wtint2yr.adj.norm                                                            
## wtmec2yr.adj                                                                 
## wtmec2yr.adj.norm                                                            
## wtint4yr.adj                                                                 
## wtint4yr.adj.norm                                                            
## wtmec4yr.adj                                                                 
## wtmec4yr.adj.norm                                                            
## tac                                             total activity counts per day
## tlac                                                 total log activity count
## wt                                              total accelerometer wear time
## st                                                                           
## mvpa                     total minutes of moderate/vigorous physical activity
## about                                                                        
## sbout                                                                        
## satp                sedentary/sleep/non-wear to active transition probability
## astp                active to sedentary/sleep/non-wear transition probability
## tlac.1                                                                       
## tlac.2                                                                       
## tlac.3                                                                       
## tlac.4                                                                       
## tlac.5                                                                       
## tlac.6                                                                       
## tlac.7                                                                       
## tlac.8                                                                       
## tlac.9                                                                       
## tlac.10                                                                      
## tlac.11                                                                      
## tlac.12                                                                      
##                     Levels   Class   Storage  NAs
## seqn                                 integer    0
## paxcal                               integer    0
## paxstat                              integer    0
## weekday                              integer    0
## sddsrvyr                             integer    0
## eligstat                             integer    0
## mortstat                             integer    0
## permth.exm                           integer    0
## permth.int                           integer    0
## ucod.leading                       character 2195
## diabetes.mcod                        integer 2195
## hyperten.mcod                        integer 2195
## sdmvpsu                              integer    0
## sdmvstra                             integer    0
## wtint2yr                              double    0
## wtmec2yr                              double    0
## ridagemn                             integer    0
## ridageex                             integer    0
## ridageyr                             integer    0
## bmi                                   double    0
## bmi.cat                  4           integer    0
## race                     5           integer    0
## gender                   2           integer    0
## diabetes                 2           integer    0
## chf                      2           integer    0
## chd                      2           integer    0
## cancer                   2           integer    0
## stroke                   2           integer    0
## educationadult           3           integer    0
## mobilityproblem          2           integer    0
## drinkstatus              4           integer    0
## drinksperweek                         double   86
## smokecigs                3           integer    0
## bpxsy1                               integer  347
## bpxsy2                               integer  486
## bpxsy3                               integer  523
## bpxsy4                               integer 2401
## lbxtc                                integer    0
## lbdhdd                               integer    0
## yr5.mort                             integer    0
## age                                   double    0
## sys                                  integer    0
## exclude                              integer    0
## wtint2yr.unadj                        double    0
## wtmec2yr.unadj                        double    0
## wtint2yr.unadj.norm                   double    0
## wtmec2yr.unadj.norm                   double    0
## wtint4yr.unadj                        double    0
## wtint4yr.unadj.norm                   double    0
## wtmec4yr.unadj                        double    0
## wtmec4yr.unadj.norm                   double    0
## wtint2yr.adj                          double    0
## wtint2yr.adj.norm                     double    0
## wtmec2yr.adj                          double    0
## wtmec2yr.adj.norm                     double    0
## wtint4yr.adj                          double    0
## wtint4yr.adj.norm                     double    0
## wtmec4yr.adj                          double    0
## wtmec4yr.adj.norm                     double    0
## tac                        numeric    double    0
## tlac                       numeric    double    0
## wt                         numeric    double    0
## st                                    double    0
## mvpa                       numeric    double    0
## about                                 double    0
## sbout                                 double    0
## satp                       numeric    double    0
## astp                       numeric    double    0
## tlac.1                                double    0
## tlac.2                                double    0
## tlac.3                                double    0
## tlac.4                                double    0
## tlac.5                                double    0
## tlac.6                                double    0
## tlac.7                                double    0
## tlac.8                                double    0
## tlac.9                                double    0
## tlac.10                               double    0
## tlac.11                               double    0
## tlac.12                               double    0
## 
## +---------------+----------------------------------------------------------+
## |Variable       |Levels                                                    |
## +---------------+----------------------------------------------------------+
## |bmi.cat        |Normal,Underweight,Overweight,Obese                       |
## +---------------+----------------------------------------------------------+
## |race           |White,Mexican American,Other Hispanic,Black,Other         |
## +---------------+----------------------------------------------------------+
## |gender         |Male,Female                                               |
## +---------------+----------------------------------------------------------+
## |diabetes       |No,Yes                                                    |
## |chf            |                                                          |
## |chd            |                                                          |
## |cancer         |                                                          |
## |stroke         |                                                          |
## +---------------+----------------------------------------------------------+
## |educationadult |Less than high school,High school,More than high school   |
## +---------------+----------------------------------------------------------+
## |mobilityproblem|No Difficulty,Any Difficulty                              |
## +---------------+----------------------------------------------------------+
## |drinkstatus    |Moderate Drinker,Non-Drinker,Heavy Drinker,Missing alcohol|
## +---------------+----------------------------------------------------------+
## |smokecigs      |Never,Former,Current                                      |
## +---------------+----------------------------------------------------------+

Back to top

5 Data cleaning

Note the data we use here have already been pre-cleaned.

Steps to achieve this were done as follows.

Exclusions: From 14631 exclude participants who were

It is useful to create a flow chart to summarize inclusion/exclusion criteria. Here is a different example from another study that combined data from surveys for two different populations:

STROBE flowdiagram

STROBE flowdiagram

There are similar flow diagrams for other types of study, e.g. randomized controlled trials (CONSORT) or meta-analyses (PRISMA). These can be found on the website EQUATOR Network.

Recoding

Set missing ‘drinkstatus’ to NA

Check data types, labels, and first few values of each of the variables needed for the analysis:

## 'data.frame':    2978 obs. of  30 variables:
##  $ yr5.mort       : int  0 0 0 1 0 0 0 0 0 0 ...
##  $ permth.exm     : int  135 149 127 24 153 154 152 150 112 151 ...
##  $ bmi            : num  31.3 25.5 19.6 28.3 38 ...
##  $ bmi.cat        : Factor w/ 4 levels "Normal","Underweight",..: 4 3 1 3 4 1 4 4 3 4 ...
##  $ race           : Factor w/ 5 levels "White","Mexican American",..: 1 1 4 1 2 4 4 1 1 1 ...
##  $ gender         : Factor w/ 2 levels "Male","Female": 1 2 1 1 2 2 2 2 1 1 ...
##  $ diabetes       : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ chf            : Factor w/ 2 levels "No","Yes": 1 1 1 1 2 1 1 1 2 1 ...
##  $ chd            : Factor w/ 2 levels "No","Yes": 1 1 1 2 2 1 1 1 2 1 ...
##  $ cancer         : Factor w/ 2 levels "No","Yes": 1 1 1 2 1 1 2 1 1 1 ...
##  $ stroke         : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ educationadult : Factor w/ 3 levels "Less than high school",..: 2 3 2 3 1 1 2 2 2 2 ...
##  $ mobilityproblem: Factor w/ 2 levels "No Difficulty",..: 1 1 2 2 1 1 1 2 2 1 ...
##  $ drinkstatus    : Factor w/ 4 levels "Moderate Drinker",..: 2 3 NA 2 2 2 2 2 2 2 ...
##  $ smokecigs      : Factor w/ 3 levels "Never","Former",..: 1 3 3 2 2 1 1 2 2 2 ...
##  $ age            : num  56 52.8 63.8 83.9 50.6 ...
##  $ sys            : int  120 133 123 154 115 131 152 116 119 127 ...
##  $ lbxtc          : int  254 174 191 141 173 230 195 156 152 177 ...
##  $ lbdhdd         : int  37 119 92 34 45 51 53 47 30 59 ...
##  $ tac            : 'labelled' num  409353 286408 130778 102563 511161 ...
##   ..- attr(*, "label")= chr "total activity counts per day"
##  $ tlac           : 'labelled' num  3522 3335 2749 2104 4390 ...
##   ..- attr(*, "label")= chr "total log activity count"
##  $ wt             : 'labelled' num  900 783 1053 813 1014 ...
##   ..- attr(*, "label")= chr "total accelerometer wear time"
##  $ mvpa           : 'labelled' num  48.29 9.43 4.71 3 40.5 ...
##   ..- attr(*, "label")= chr "total minutes of moderate/vigorous physical activity"
##  $ satp           : 'labelled' num  0.0982 0.0885 0.0909 0.0724 0.1229 ...
##   ..- attr(*, "label")= chr "sedentary/sleep/non-wear to active transition probability"
##  $ astp           : 'labelled' num  0.232 0.215 0.405 0.335 0.177 ...
##   ..- attr(*, "label")= chr "active to sedentary/sleep/non-wear transition probability"
##  $ bpxsy1         : int  124 128 126 154 118 130 164 124 124 132 ...
##  $ bpxsy2         : int  124 138 120 NA 110 NA 144 114 116 126 ...
##  $ bpxsy3         : int  112 134 122 NA 116 132 148 110 118 124 ...
##  $ bpxsy4         : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ drinksperweek  : num  0 28 NA 0 0 0 0 0 0 0 ...

A quick summary of the variables can be done by using summary(nhanesdat[,vars]) of more elaborate with the package summarytools.

The mean of the total log activity counts is 2758.53.

The mean of the total log activity counts is 2758.53.

Data Frame Summary

Dimensions: 2978 x 28
Duplicates: 0
No Variable Label Stats / Values Freqs (% of Valid) Graph Valid Missing
1 bmi [numeric] mean (sd) : 28.8 (5.8) min < med < max : 13.36 < 28.01 < 57.41 IQR (CV) : 7.1 (0.2) 1551 distinct values 2978 (100%) 0 (0%)
2 bmi.cat [factor] 1. Normal 2. Underweight 3. Overweight 4. Obese 760 (25.5%) 29 (1.0%) 1150 (38.6%) 1039 (34.9%) 2978 (100%) 0 (0%)
3 race [factor] 1. White 2. Mexican American 3. Other Hispanic 4. Black 5. Other 1756 (59.0%) 538 (18.1%) 56 (1.9%) 534 (17.9%) 94 (3.2%) 2978 (100%) 0 (0%)
4 gender [factor] 1. Male 2. Female 1523 (51.1%) 1455 (48.9%) 2978 (100%) 0 (0%)
5 diabetes [factor] 1. No 2. Yes 2460 (82.6%) 518 (17.4%) 2978 (100%) 0 (0%)
6 chf [factor] 1. No 2. Yes 2810 (94.4%) 168 (5.6%) 2978 (100%) 0 (0%)
7 chd [factor] 1. No 2. Yes 2734 (91.8%) 244 (8.2%) 2978 (100%) 0 (0%)
8 cancer [factor] 1. No 2. Yes 2523 (84.7%) 455 (15.3%) 2978 (100%) 0 (0%)
9 stroke [factor] 1. No 2. Yes 2804 (94.2%) 174 (5.8%) 2978 (100%) 0 (0%)
10 educationadult [factor] 1. Less than high school 2. High school 3. More than high school 945 (31.7%) 739 (24.8%) 1294 (43.5%) 2978 (100%) 0 (0%)
11 mobilityproblem [factor] 1. No Difficulty 2. Any Difficulty 2038 (68.4%) 940 (31.6%) 2978 (100%) 0 (0%)
12 drinkstatus [factor] 1. Moderate Drinker 2. Non-Drinker 3. Heavy Drinker 4. Missing alcohol 1445 (50.0%) 1266 (43.8%) 181 (6.3%) 0 (0.0%) 2892 (97.11%) 86 (2.89%)
13 smokecigs [factor] 1. Never 2. Former 3. Current 1326 (44.5%) 1147 (38.5%) 505 (17.0%) 2978 (100%) 0 (0%)
14 age [numeric] mean (sd) : 65.85 (9.62) min < med < max : 50 < 65.42 < 84.92 IQR (CV) : 15.25 (0.15) 417 distinct values 2978 (100%) 0 (0%)
15 sys [integer] mean (sd) : 133.56 (21.36) min < med < max : 73 < 131 < 270 IQR (CV) : 26 (0.16) 131 distinct values 2978 (100%) 0 (0%)
16 lbxtc [integer] mean (sd) : 205.03 (42.3) min < med < max : 82 < 203 < 458 IQR (CV) : 55 (0.21) 239 distinct values 2978 (100%) 0 (0%)
17 lbdhdd [integer] mean (sd) : 55.31 (16.36) min < med < max : 17 < 53 < 188 IQR (CV) : 21 (0.3) 98 distinct values 2978 (100%) 0 (0%)
18 tac [labelled, numeric] total activity counts per day mean (sd) : 209786.13 (112910.97) min < med < max : 8931.83 < 193949.91 < 912075.6 IQR (CV) : 145388.35 (0.54) 2975 distinct values 2978 (100%) 0 (0%)
19 tlac [labelled, numeric] total log activity count mean (sd) : 2758.53 (727.2) min < med < max : 429.93 < 2754.01 < 5655.47 IQR (CV) : 978.6 (0.26) 2976 distinct values 2978 (100%) 0 (0%)
20 wt [labelled, numeric] total accelerometer wear time mean (sd) : 878.53 (138.51) min < med < max : 615.33 < 860.69 < 1440 IQR (CV) : 137.8 (0.16) 2215 distinct values 2978 (100%) 0 (0%)
21 mvpa [labelled, numeric] total minutes of moderate/vigorous physical activity mean (sd) : 13.85 (17.05) min < med < max : 0 < 7.43 < 152.33 IQR (CV) : 16.79 (1.23) 734 distinct values 2978 (100%) 0 (0%)
22 satp [labelled, numeric] sedentary/sleep/non-wear to active transition probability mean (sd) : 0.08 (0.02) min < med < max : 0.01 < 0.08 < 0.2 IQR (CV) : 0.03 (0.27) 2976 distinct values 2978 (100%) 0 (0%)
23 astp [labelled, numeric] active to sedentary/sleep/non-wear transition probability mean (sd) : 0.3 (0.09) min < med < max : 0.05 < 0.29 < 0.74 IQR (CV) : 0.11 (0.3) 2976 distinct values 2978 (100%) 0 (0%)
24 bpxsy1 [integer] mean (sd) : 134.92 (22.08) min < med < max : 74 < 132 < 270 IQR (CV) : 26 (0.16) 73 distinct values 2631 (88.35%) 347 (11.65%)
25 bpxsy2 [integer] mean (sd) : 132.93 (21.05) min < med < max : 72 < 132 < 226 IQR (CV) : 26 (0.16) 71 distinct values 2492 (83.68%) 486 (16.32%)
26 bpxsy3 [integer] mean (sd) : 131.43 (20.76) min < med < max : 80 < 130 < 222 IQR (CV) : 26 (0.16) 69 distinct values 2455 (82.44%) 523 (17.56%)
27 bpxsy4 [integer] mean (sd) : 132.89 (20.02) min < med < max : 80 < 130 < 204 IQR (CV) : 24 (0.15) 55 distinct values 577 (19.38%) 2401 (80.62%)
28 drinksperweek [numeric] mean (sd) : 2.66 (6.7) min < med < max : 0 < 0.06 < 112 IQR (CV) : 2 (2.52) 99 distinct values 2892 (97.11%) 86 (2.89%)

Generated by summarytools 0.8.9 (R version 3.6.1)
2020-04-11

The outcome variable is 5 year mortality.

Data Frame Summary

Dimensions: 2978 x 2
Duplicates: 2821
No Variable Stats / Values Freqs (% of Valid) Graph Valid Missing
1 yr5.mort [integer] mean (sd) : 0.1 (0.3) min < med < max : 0 < 0 < 1 mode: 0 : 2681 (90.0%) 1 : 297 (10.0%) 2978 (100%) 0 (0%)
2 permth.exm [integer] mean (sd) : 116.33 (33.52) min < med < max : 1 < 126 < 157 IQR (CV) : 27 (0.29) 157 distinct values 2978 (100%) 0 (0%)

Generated by summarytools 0.8.9 (R version 3.6.1)
2020-04-11

5.0.1 Univariate distributions for continuous variables

If there is a marathon runner, what would the TAC be?

Unrealistic high or low values may be detected by histograms or by calculating minimum and maximum values. If a variable ‘Chemo’ is expected to have 0-1 values (non treated or treated with chemo) and instead we see numbers 4,2,7, etc, this means that column labels may have been swapped. If we see a year of birth of 2078, this is likely 1978 instead. If the value can be checked or is obvious, it can be corrected, sometimes it is necessary to set the value to missing instead. It is good to check dates and length of time, for example subtracting the follow-up date from the data of study entry or the data of birth to make sure there are no negative time differences.

5.0.3 Interactive graphics

5.0.4 Pairwise associations

Visualizing a correlation matrix between physical activity variables and age

##             age        tac       tlac          wt       mvpa       satp
## age   1.0000000 -0.4316314 -0.3842809 -0.01991550 -0.3259843 -0.3331776
## tac  -0.4316314  1.0000000  0.8127898  0.17090389  0.8414233  0.4776727
## tlac -0.3842809  0.8127898  1.0000000  0.34284265  0.4773299  0.8103376
## wt   -0.0199155  0.1709039  0.3428426  1.00000000  0.1042871  0.4643814
## mvpa -0.3259843  0.8414233  0.4773299  0.10428711  1.0000000  0.2114295
## satp -0.3331776  0.4776727  0.8103376  0.46438139  0.2114295  1.0000000
## astp  0.2976469 -0.7547711 -0.7212789  0.05170059 -0.4553590 -0.3197902
##             astp
## age   0.29764689
## tac  -0.75477108
## tlac -0.72127891
## wt    0.05170059
## mvpa -0.45535905
## satp -0.31979021
## astp  1.00000000

The highest correlations are between the pairs (tac, tlac), (mvpa, tac), and (satp,tlac).

Associations can be summarized in a scatter plot with “trend lines.”

Visualization of other types of assocations:

Unrealistic combinations of values can be detected by scatter plots or cross tables.

Back to top

6 Data screening

Some of the quantitative or graphical summaries are identical to the ones in the data cleaning step. However, now the focus is on data properties and potential impact on the intended statistical analysis plan.


Review data summaries from the previous step with this new purpose in mind.

Examples of what data screening might reveal:

  1. If we observe an uneven distribution of age and most of the survey respondents turn out to be around 50-60 years old, this would reveal that older participants are not well represented.

  2. Health behaviors may differ between different age groups, thus there would be a question of selection bias and generalizability.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    80.0   120.0   130.0   132.9   144.0   204.0    2401
## 
##     FALSE      TRUE 
## 0.1937542 0.8062458

Here we see that 80% of the 4th measurement of the systolic blood pressure is missing.

6.0.1 Missing data patterns

There does not seem to be a pattern of missingness.

There may be situations where answers to question are missing depending on education level which would point to a systematic pattern.

We can check differences in variables stratified by education level, if this could have an impact on the intended analysis plan.

Participant characteristics by education level
Less than high school (N=945) High school (N=739) More than high school (N=1294) Total (N=2978)
yr5.mort
   Median 0.000 0.000 0.000 0.000
   Q1, Q3 0.000, 0.000 0.000, 0.000 0.000, 0.000 0.000, 0.000
   Range 0.000 - 1.000 0.000 - 1.000 0.000 - 1.000 0.000 - 1.000
permth.exm
   Median 122.000 125.000 127.000 126.000
   Q1, Q3 101.000, 137.000 111.000, 137.000 115.000, 139.000 111.000, 138.000
   Range 2.000 - 156.000 1.000 - 157.000 2.000 - 157.000 1.000 - 157.000
bmi
   Median 27.800 28.400 27.885 28.005
   Q1, Q3 24.850, 31.980 24.895, 32.225 24.678, 31.758 24.793, 31.890
   Range 15.920 - 53.540 14.700 - 50.750 13.360 - 57.410 13.360 - 57.410
bmi.cat
   Normal 239 (25.3%) 182 (24.6%) 339 (26.2%) 760 (25.5%)
   Underweight 7 (0.7%) 8 (1.1%) 14 (1.1%) 29 (1.0%)
   Overweight 366 (38.7%) 270 (36.5%) 514 (39.7%) 1150 (38.6%)
   Obese 333 (35.2%) 279 (37.8%) 427 (33.0%) 1039 (34.9%)
race
   White 335 (35.4%) 509 (68.9%) 912 (70.5%) 1756 (59.0%)
   Mexican American 366 (38.7%) 81 (11.0%) 91 (7.0%) 538 (18.1%)
   Other Hispanic 30 (3.2%) 7 (0.9%) 19 (1.5%) 56 (1.9%)
   Black 195 (20.6%) 121 (16.4%) 218 (16.8%) 534 (17.9%)
   Other 19 (2.0%) 21 (2.8%) 54 (4.2%) 94 (3.2%)
gender
   Male 495 (52.4%) 350 (47.4%) 678 (52.4%) 1523 (51.1%)
   Female 450 (47.6%) 389 (52.6%) 616 (47.6%) 1455 (48.9%)
diabetes
   No 733 (77.6%) 609 (82.4%) 1118 (86.4%) 2460 (82.6%)
   Yes 212 (22.4%) 130 (17.6%) 176 (13.6%) 518 (17.4%)
chf
   No 874 (92.5%) 697 (94.3%) 1239 (95.7%) 2810 (94.4%)
   Yes 71 (7.5%) 42 (5.7%) 55 (4.3%) 168 (5.6%)
chd
   No 863 (91.3%) 672 (90.9%) 1199 (92.7%) 2734 (91.8%)
   Yes 82 (8.7%) 67 (9.1%) 95 (7.3%) 244 (8.2%)
cancer
   No 831 (87.9%) 623 (84.3%) 1069 (82.6%) 2523 (84.7%)
   Yes 114 (12.1%) 116 (15.7%) 225 (17.4%) 455 (15.3%)
stroke
   No 872 (92.3%) 696 (94.2%) 1236 (95.5%) 2804 (94.2%)
   Yes 73 (7.7%) 43 (5.8%) 58 (4.5%) 174 (5.8%)
mobilityproblem
   No Difficulty 556 (58.8%) 480 (65.0%) 1002 (77.4%) 2038 (68.4%)
   Any Difficulty 389 (41.2%) 259 (35.0%) 292 (22.6%) 940 (31.6%)
drinkstatus
   N-Miss 37 24 25 86
   Moderate Drinker 353 (38.9%) 354 (49.5%) 738 (58.2%) 1445 (50.0%)
   Non-Drinker 507 (55.8%) 316 (44.2%) 443 (34.9%) 1266 (43.8%)
   Heavy Drinker 48 (5.3%) 45 (6.3%) 88 (6.9%) 181 (6.3%)
   Missing alcohol 0 (0.0%) 0 (0.0%) 0 (0.0%) 0 (0.0%)
smokecigs
   Never 398 (42.1%) 333 (45.1%) 595 (46.0%) 1326 (44.5%)
   Former 347 (36.7%) 275 (37.2%) 525 (40.6%) 1147 (38.5%)
   Current 200 (21.2%) 131 (17.7%) 174 (13.4%) 505 (17.0%)
age
   Median 68.167 66.250 62.667 65.417
   Q1, Q3 61.250, 75.667 59.083, 73.500 55.667, 70.917 58.083, 73.333
   Range 50.000 - 84.917 50.000 - 84.833 50.000 - 84.917 50.000 - 84.917
sys
   Median 134.000 132.000 129.000 131.000
   Q1, Q3 122.000, 149.000 119.000, 145.000 117.000, 142.000 119.000, 145.000
   Range 80.000 - 270.000 73.000 - 217.000 85.000 - 215.000 73.000 - 270.000
lbxtc
   Median 204.000 199.000 205.000 203.000
   Q1, Q3 172.000, 231.000 174.000, 228.000 178.000, 229.750 175.000, 230.000
   Range 102.000 - 458.000 82.000 - 427.000 98.000 - 367.000 82.000 - 458.000
lbdhdd
   Median 52.000 52.000 54.000 53.000
   Q1, Q3 42.000, 62.000 43.000, 64.000 44.000, 67.000 43.000, 64.000
   Range 23.000 - 188.000 24.000 - 122.000 17.000 - 154.000 17.000 - 188.000
total activity counts per day
   Median 177720.857 186759.714 206617.929 193949.914
   Q1, Q3 113655.429, 272434.000 123946.871, 253500.643 138213.613, 278977.357 126602.250, 271990.600
   Range 26157.714 - 912075.600 20141.333 - 825133.333 8931.833 - 784031.429 8931.833 - 912075.600
total log activity count
   Median 2746.781 2721.324 2770.573 2754.012
   Q1, Q3 2191.198, 3272.205 2257.260, 3215.428 2316.710, 3246.734 2262.797, 3241.398
   Range 661.230 - 5655.468 779.557 - 5322.091 429.929 - 5078.713 429.929 - 5655.468
total accelerometer wear time
   Median 837.143 857.857 875.310 860.690
   Q1, Q3 771.333, 915.714 792.357, 929.143 812.607, 937.125 793.232, 931.036
   Range 615.333 - 1440.000 627.000 - 1440.000 615.333 - 1440.000 615.333 - 1440.000
total minutes of moderate/vigorous physical activity
   Median 5.667 5.800 9.633 7.429
   Q1, Q3 2.000, 16.286 2.071, 16.143 3.333, 22.714 2.500, 19.286
   Range 0.000 - 136.600 0.000 - 152.333 0.000 - 122.286 0.000 - 152.333
sedentary/sleep/non-wear to active transition probability
   Median 0.080 0.081 0.082 0.081
   Q1, Q3 0.067, 0.095 0.066, 0.096 0.068, 0.097 0.067, 0.096
   Range 0.018 - 0.199 0.023 - 0.169 0.008 - 0.197 0.008 - 0.199
active to sedentary/sleep/non-wear transition probability
   Median 0.287 0.292 0.288 0.289
   Q1, Q3 0.231, 0.353 0.240, 0.346 0.245, 0.339 0.240, 0.345
   Range 0.073 - 0.690 0.094 - 0.745 0.052 - 0.728 0.052 - 0.745
bpxsy1
   Median 136.000 134.000 130.000 132.000
   Q1, Q3 124.000, 152.000 120.000, 146.000 118.000, 143.000 120.000, 146.000
   Range 80.000 - 270.000 74.000 - 212.000 86.000 - 228.000 74.000 - 270.000
bpxsy2
   Median 134.000 132.000 128.000 132.000
   Q1, Q3 122.000, 150.000 118.000, 144.000 116.000, 140.000 118.000, 144.000
   Range 80.000 - 226.000 72.000 - 218.000 82.000 - 204.000 72.000 - 226.000
bpxsy3
   Median 132.000 130.000 126.000 130.000
   Q1, Q3 120.000, 146.000 117.500, 144.000 116.000, 140.000 118.000, 144.000
   Range 88.000 - 222.000 80.000 - 216.000 84.000 - 196.000 80.000 - 222.000
bpxsy4
   Median 132.000 131.000 130.000 130.000
   Q1, Q3 120.000, 144.000 118.500, 144.000 118.000, 144.000 120.000, 144.000
   Range 86.000 - 204.000 80.000 - 188.000 90.000 - 202.000 80.000 - 204.000
drinksperweek
   Median 0.000 0.038 0.230 0.058
   Q1, Q3 0.000, 0.467 0.000, 1.630 0.000, 3.000 0.000, 2.000
   Range 0.000 - 63.000 0.000 - 112.000 0.000 - 70.000 0.000 - 112.000

Or check specific item only (total log activity count)

## # A tibble: 3 x 2
##   educationadult        `mean(tlac, na.rm = TRUE)`
##   <fct>                                      <dbl>
## 1 Less than high school                      2755.
## 2 High school                                2735.
## 3 More than high school                      2774.

Back to top

7 Initial data reporting

An initial data report would be, for example, an R markdown file that documents all insights obtained from the previous steps to inform collaborators.

7.0.1 Creating tables

Here are two R packages with nice output for tables:

age gender race educationadult astp satp
56.00000 Male White High school 0.2315452 0.09819184
52.83333 Female White More than high school 0.2147554 0.08849793
63.83333 Male Black High school 0.4045712 0.09094480
83.91667 Male White More than high school 0.3353814 0.07236747
50.58333 Female Mexican American Less than high school 0.1767232 0.12290906
55.58333 Female Black Less than high school 0.3031173 0.09761710
57.33333 Female Black High school 0.2240179 0.08643943
84.25000 Female White High school 0.4687016 0.07410545
69.00000 Male White High school 0.3659029 0.05526371
55.91667 Male White High school 0.2614186 0.11077929
Note:
These are the first 10 participants.
Table 1: Association with Total activity count
Term Coefficient SE T-statistic P-value
(Intercept) 95942.577 8809.055 10.891 0
genderFemale -6752.666 1868.96 -3.613 0.0003076
age -1087.987 103.912 -10.47 0
mvpa 4923.248 57.987 84.903 0
satp 1463022.973 43067.328 33.971 0


Back to top

8 Refining/Updating the analysis plan

The research questions and statistical analysis plan are established before the initial data analysis. The data screening steps might reveal that the originally planned statistical model is not feasible.

There are a number of reasons for changing a statistical analysis plan.

## 
##      Normal Underweight  Overweight       Obese 
## 0.255204835 0.009738079 0.386165212 0.348891874

Example: Comorbidities can have too few events. If the effect on the outcome is similar, it may be possible to combine several.

## 
##   No  Yes 
## 2804  174
## 
##         No        Yes 
## 0.94157152 0.05842848
## 
##   No  Yes 
## 2669  309
## 
##        No       Yes 
## 0.8962391 0.1037609

Alternatives:

Sometimes we discover during the IDA process that the original research question can not be answered. Sometimes we discover that no further modeling is required and that an Initial Data Analysis (descriptive summary) is enough for the manuscript.

Back to top

9 Reporting IDA findings in the final report/manuscript

Back to top

10 Session info

## R version 3.6.1 (2019-07-05)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Mojave 10.14.6
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] grid      stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] plotly_4.9.0       pixiedust_0.8.6    kableExtra_1.1.0   VIM_4.8.0         
##  [5] data.table_1.12.8  colorspace_1.4-1   GGally_1.4.0       gridExtra_2.3     
##  [9] arsenal_3.3.0      summarytools_0.8.9 Hmisc_4.2-0        Formula_1.2-3     
## [13] survival_2.44-1.1  lattice_0.20-38    tidyr_1.0.0        dplyr_0.8.5       
## [17] ggplot2_3.3.0      rmarkdown_1.15    
## 
## loaded via a namespace (and not attached):
##   [1] pryr_0.1.4          ellipsis_0.3.0      class_7.3-15       
##   [4] rio_0.5.16          htmlTable_1.13.2    base64enc_0.1-3    
##   [7] rstudioapi_0.11     farver_2.0.3        fansi_0.4.1        
##  [10] lubridate_1.7.4     ranger_0.11.2       xml2_1.2.2         
##  [13] codetools_0.2-16    splines_3.6.1       robustbase_0.93-5  
##  [16] knitr_1.28          jsonlite_1.6        broom_0.5.2        
##  [19] cluster_2.1.0       shiny_1.3.2         readr_1.3.1        
##  [22] compiler_3.6.1      httr_1.4.1          backports_1.1.5    
##  [25] assertthat_0.2.1    Matrix_1.2-17       lazyeval_0.2.2     
##  [28] cli_2.0.2           later_0.8.0         acepack_1.4.1      
##  [31] htmltools_0.3.6     tools_3.6.1         gtable_0.3.0       
##  [34] glue_1.3.2          Rcpp_1.0.4          carData_3.0-2      
##  [37] cellranger_1.1.0    vctrs_0.2.4         nlme_3.1-140       
##  [40] crosstalk_1.0.0     lmtest_0.9-37       xfun_0.12          
##  [43] laeken_0.5.0        stringr_1.4.0       openxlsx_4.1.0.1   
##  [46] rvest_0.3.4         mime_0.9            lifecycle_0.2.0    
##  [49] DEoptimR_1.0-8      MASS_7.3-51.4       zoo_1.8-6          
##  [52] scales_1.1.0        hms_0.5.3           promises_1.0.1     
##  [55] RColorBrewer_1.1-2  yaml_2.2.1          curl_4.3           
##  [58] pander_0.6.3        rpart_4.1-15        reshape_0.8.8      
##  [61] latticeExtra_0.6-28 stringi_1.4.6       highr_0.8          
##  [64] e1071_1.7-2         checkmate_1.9.4     boot_1.3-22        
##  [67] zip_2.0.4           rlang_0.4.5         pkgconfig_2.0.3    
##  [70] matrixStats_0.55.0  bitops_1.0-6        evaluate_0.14      
##  [73] purrr_0.3.3         rapportools_1.0     htmlwidgets_1.3    
##  [76] labeling_0.3        tidyselect_1.0.0    plyr_1.8.4         
##  [79] magrittr_1.5        R6_2.4.1            generics_0.0.2     
##  [82] pillar_1.4.3        haven_2.2.0         foreign_0.8-71     
##  [85] withr_2.1.2         mgcv_1.8-28         abind_1.4-5        
##  [88] RCurl_1.95-4.12     sp_1.3-1            nnet_7.3-12        
##  [91] tibble_2.1.3        crayon_1.3.4        car_3.0-3          
##  [94] utf8_1.1.4          readxl_1.3.1        forcats_0.5.0      
##  [97] vcd_1.4-4           digest_0.6.25       webshot_0.5.1      
## [100] xtable_1.8-4        httpuv_1.5.2        munsell_0.5.0      
## [103] viridisLite_0.3.0

Back to top

11 References


  1. CSTAT thanks Marianne Huebner for the development of this module.